HW3: Wikipedia Clustering
نویسندگان
چکیده
Overview Our goal for this assignment was to recreate Wikipedia’s article groupings into semantic categories by means of clustering. Given an input of Wikipedia’s XML dump, we designed a pipeline of MapReduce jobs aimed at clustering the articles. We then use the categorical groupings of Wikipedia to create a ground truth for clusters, and measure how well our clusters align to the ground truth. At this time we have implemented all stages of this pipeline and have working code that we include with our submission. Unfortunately, we were not able to run our scripts to completion and do not include final results due to challenges with cluster availability.
منابع مشابه
Categorization of Wikipedia Articles with Spectral Clustering
The article reports application of clustering algorithms for creating hierarchical groups within Wikipedia articles. We evaluate three spectral clustering algorithms based on datasets constructed with usage of Wikipedia categories. Selected algorithm has been implemented in the system that categorize Wikipedia search results in the fly.
متن کاملClustering Document with Active Learning using Wikipedia
Wikipedia has been applied as a background knowledge base to various text mining problems, including document categorization, topic indexing and information extraction. However, very few attempts have been made to utilize it for document clustering. In this paper we propose to exploit Wikipedia and the semantic knowledge therein to facilitate clustering, enabling the automatic grouping of docum...
متن کاملClustering of Wikipedia Pages on Edit Behaviors
We consider the edit history of Wikipedia to perform clustering of the pages. We conjecture that the editors exhibit homophily or high correlation (in terms of the topics of interests). Therefore, it is possible to utilize the edit history to cluster pages having same or closely related topics. We validate our clustering results with the list of categories and the incoming and outgoing links on...
متن کاملConceptual Hierarchical Clustering of Documents using Wikipedia knowledge
In this paper, we propose a novel method for conceptual hierarchical clustering of documents using knowledge extracted from Wikipedia. A robust and compact document representation is built in real-time using the Wikipedia API. The clustering process is hierarchical and creates cluster labels which are descriptive and important for the examined corpus. Experiments show that the proposed techniqu...
متن کاملCS294-1 A3: Large-scale Clustering
In this project, we are given a task of clustering wikipedia articles. As the data size is relatively large and cannot be memory-resident on a single node computer, we first adopt map-reduce dataflow to extract the word counts and build feature matrices. Given the compact representation of feature matrix, the clustering task is also computationally challenging due to the large number (tens of m...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012